In this part of the project, I will conduct an exploratory data analysis of the Prosper Loan Data. I will explore the dataset’s variables and understand the data’s structure, oddities, patterns and relationships. The analysis will go from simple univariate relationships up through multivariate relationships.
The dataset contains about 114,000 loans with 81 variables on each loan, which include loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# load in the dataset into a pandas dataframe
prosper = pd.read_csv('prosperLoanData.csv')
# overview of data shape and composition
prosper.shape
# general overview of the dataset
prosper.info()
# check some samples of the dataset
prosper.sample(5)
LoanOriginationDate and ListingCreationDate seem to less than a month
# comparing LoanOriginationDate and ListingCreationDate
prosper[['LoanOriginationDate', 'ListingCreationDate']].head()
#checking for consistency between CreditScoreRangeUpper and CreditScoreRangeLower
prosper[['CreditScoreRangeUpper','CreditScoreRangeLower']].sample(10)
prosper['CreditScoreRangeUpper'].max()
Since the difference between CreditScoreRangeUpper and CreditScoreRangeLower is 19, I will go with just one of them
# Select only columns of interest.
cols = ['LoanOriginalAmount','BorrowerAPR','BorrowerRate','DebtToIncomeRatio','StatedMonthlyIncome','CreditScoreRangeUpper',
'Term','ListingCategory (numeric)', 'BorrowerState', 'LoanOriginationDate', 'LoanOriginationQuarter','Occupation',
'EmploymentStatus','MonthlyLoanPayment','Investors','InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',
'Recommendations','IsBorrowerHomeowner']
prosper = prosper[cols]
prosper.info()
# Drop rows with missing BorrowerAPR and CreditScoreRangeUpper
prosper = prosper[prosper['BorrowerAPR'].notnull()]
prosper = prosper[prosper['CreditScoreRangeUpper'].notnull()]
# check for duplicated entries
prosper.duplicated().sum()
# drop duplicates in the dataset
prosper = prosper.drop_duplicates()
# Test for confirmation
prosper.duplicated().sum()
# filling missing values in the Occupation and BorrowerState as NotAvailable
prosper.Occupation = prosper.Occupation.fillna('NotAvailable')
prosper.BorrowerState = prosper.BorrowerState.fillna('NotAvailable')
# check the values of EmploymentStatus
prosper.EmploymentStatus.value_counts()
# using the .replace method to change all 'Full-time' EmploymentStatus to 'Employed' and
# filling missing values with 'Not available'
prosper.EmploymentStatus.replace('Full-time', value='Employed', inplace=True)
prosper.EmploymentStatus = prosper.EmploymentStatus.fillna('Not available')
# test for confirmation
prosper.EmploymentStatus.value_counts()
# Splitting LoanOriginationQuarter column into Quarter and Year. Also, extracting Month from ListCreationDate
prosper['Month'] = prosper['LoanOriginationDate'].apply(lambda x: x.split("-")[1]).astype(str)
prosper['Quarter'] = prosper['LoanOriginationQuarter'].apply(lambda x: x.split(" ")[0]).astype(str)
prosper['Year'] = prosper['LoanOriginationQuarter'].apply(lambda x: x.split(" ")[1]).astype(str)
# unique values of Year
prosper.Year.unique()
# unique values of Quarter
prosper.Quarter.unique()
# unique values of Month
prosper.Month.unique()
# replacing numerical values of Month with names
prosper.Month.replace(['01','02','03','04','05','06','07','08','09','10','11','12'],
['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sept','Oct','Nov','Dec'], inplace = True)
# testing
prosper.Month.unique()
# filling missing values in DebtToIncomeRatio column with its mean
prosper.DebtToIncomeRatio = prosper.DebtToIncomeRatio.fillna(prosper.DebtToIncomeRatio.mean())
# create ListingCategory column and assign actual value to the numerical values
conditions = [(prosper['ListingCategory (numeric)'] == 0), (prosper['ListingCategory (numeric)'] == 1), (prosper['ListingCategory (numeric)'] == 2),
(prosper['ListingCategory (numeric)'] == 3), (prosper['ListingCategory (numeric)'] == 4), (prosper['ListingCategory (numeric)'] == 5),
(prosper['ListingCategory (numeric)'] == 6), (prosper['ListingCategory (numeric)'] == 7), (prosper['ListingCategory (numeric)'] == 8),
(prosper['ListingCategory (numeric)'] == 9), (prosper['ListingCategory (numeric)'] == 10), (prosper['ListingCategory (numeric)'] == 11),
(prosper['ListingCategory (numeric)'] == 12), (prosper['ListingCategory (numeric)'] == 13), (prosper['ListingCategory (numeric)'] == 14),
(prosper['ListingCategory (numeric)'] == 15), (prosper['ListingCategory (numeric)'] == 16), (prosper['ListingCategory (numeric)'] == 17),
(prosper['ListingCategory (numeric)'] == 18), (prosper['ListingCategory (numeric)'] == 19), (prosper['ListingCategory (numeric)'] == 20)]
values = ['Not Available','Debt Consolidation','Home Improvement','Business','Personal Loan','Student Use','Auto','Other',
'Baby&Adoption','Boat','Cosmetic Procedure','Engagement Ring','Green Loans','Household Expenses','Large Purchases',
'Medical/Dental','Motorcycle','RV','Taxes','Vacation','Wedding Loans']
prosper['ListingCategory'] = np.select(conditions, values)
# Testing
prosper['ListingCategory'].value_counts()
The FICO Score , which is the most widely used cerdit scoring model, falls in a range that goes up to 850. The lowest credit score in this range is 300.
# check for Credit Score less than 300
prosper[prosper['CreditScoreRangeUpper'] < 300].count()
# drop all rows with credit score less than 300
prosper = prosper[prosper['CreditScoreRangeUpper'] >= 300]
According to FICO Score, 300 - 579 is poor, 580 - 669 is fair, 670 - 739 is good, 740 - 799 is very good while above 800 is excellent
# create CreditScore column and assign grades to the numerical values
scores = [(prosper['CreditScoreRangeUpper'] >= 300) & (prosper['CreditScoreRangeUpper'] < 580),
(prosper['CreditScoreRangeUpper'] >= 580) & (prosper['CreditScoreRangeUpper'] < 670),
(prosper['CreditScoreRangeUpper'] >= 670) & (prosper['CreditScoreRangeUpper'] < 740),
(prosper['CreditScoreRangeUpper'] >= 740) & (prosper['CreditScoreRangeUpper'] < 800),
(prosper['CreditScoreRangeUpper'] >= 800)]
grades = ['Poor', 'Fair', 'Good', 'Very good', 'Excellent']
prosper['CreditScore'] = np.select(scores, grades)
# convert EmploymentStatus, Year, Quarter, Month and CreditScore into ordered categorical types
ordinal_var_dict = {'EmploymentStatus': ['Self-employed','Employed','Part-time','Retired','Other','Not employed','Not available'],
'Year': ['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014'],
'Quarter': ['Q1', 'Q2', 'Q3','Q4'],
'Month': ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sept','Oct','Nov','Dec'],
'CreditScore': ['Poor','Fair','Good','Very good','Excellent']
}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
prosper[var] = prosper[var].astype(ordered_var)
# count values
prosper['CreditScore'].value_counts()
# drop the columns that are no more required for the analysis
prosper.drop(['LoanOriginationQuarter', 'ListingCategory (numeric)', 'CreditScoreRangeUpper'], axis=1, inplace = True)
In this section, I will explore the dataset by creating visualizations.
First, let's check some properties of the dataset
prosper.info()
prosper.shape
prosper.describe()
prosper.sample(10)
There are 112,342 loan data in the dataset with 21 features, some of which are categorical such as ListingCategory, BorrowerState, Occupation and so on, while others are numerical/quantitative.
I am interested in deteriming which factor(s) contribute to making a good FICO (Credit) score which might in turn determines the assessibilty and amount of loan from ProsperLoan.
I will also like to check for what category of people take most loans.
I expect that a borrower with higher income should have a good Credit score and have access to higher amount of loan. Also, the purpose of the loan should be a determining factor whether the loan will be granted or not.
In this section, I will investigate distributions of individual variables. If there is any unusual points or outliers, I will take a deeper look to clean things up and prepare myself to look at relationships between variables.
I will start with the distribution of the loan amount
binsize = 500
bins = np.arange(0, prosper['LoanOriginalAmount'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = prosper, x = 'LoanOriginalAmount', bins = bins)
plt.xlabel('Amount ($)')
plt.show()
To get a clearer picture, I will like to use the log plot of the LoanOriginalAmount
log_binsize = 0.025
bins = 10 ** np.arange(3.0, np.log10(prosper['LoanOriginalAmount'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = prosper, x = 'LoanOriginalAmount', bins = bins)
plt.xscale('log')
plt.xticks([500, 1e3, 2e3, 5e3, 1e4, 2e4, 3e4, 4e4], [500, '1k', '2k', '5k', '10k', '20k', '30k', '40k'])
plt.xlabel('Loan Amount ($)')
plt.show()
The two charts show that the loan amount ranges from 1,000 to 35,000 with 4,000 being the most popular followed by 15,000 and 10,000. More than 6,000 people also borrowed 2,000, 4,000 and 5,000.
Now to the Monthly Loan Payment of the borrowers
binsize = 20
bins = np.arange(0, prosper['MonthlyLoanPayment'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = prosper, x = 'MonthlyLoanPayment', bins = bins)
plt.xlabel('Monthly Payment ($)')
plt.show()
To get a clearer picture, let's check how many people pay more than $1500 monthly
prosper[prosper['MonthlyLoanPayment'] > 1500].count()
22 out of over 110,000 borrowers, I will drop them as they are not likely to contribute to this analysis. Then re-plot the histogram
# drop rows with MonthlyLoanPayment > 1500
prosper = prosper[prosper['MonthlyLoanPayment'] < 1500]
binsize = 20
bins = np.arange(0, prosper['MonthlyLoanPayment'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = prosper, x = 'MonthlyLoanPayment', bins = bins)
plt.xlabel('Monthly Payment ($)')
plt.show()
The distribution is multimodal with the highest peak at 200, a smaller one at 150 and another one at 350
The distribution of stated monthly income is next
bins_smi = np.arange(0, 50000, 200)
plt.hist(data = prosper, x = 'StatedMonthlyIncome', bins=bins_smi);
The distribution is unimodal and right skewed with most values between 0 and 2000
# Distribution of stated monthly income ranging from 0 to 2000
bins_smi = np.arange(0, 20000, 200)
plt.hist(data = prosper, x = 'StatedMonthlyIncome', bins=bins_smi);
This shows that the peak is 4,000. Stated monthly income of 30,000 and above are outliers, so need to be dropped
# check for StatedMonthlyIncome greater than $30000
prosper[prosper['StatedMonthlyIncome']>30000].count()
# drop the rows in which StatedMonthlyIncome is greater than $30000
prosper = prosper[prosper['StatedMonthlyIncome'] <= 30000]
Now to the distribution of the Month, Quarter, Year and the state of the borrowers
fig, ax = plt.subplots(nrows=4, figsize = [12,15])
default_color = sb.color_palette()[0]
sb.countplot(data = prosper, x = 'Month', color = default_color, ax = ax[0])
sb.countplot(data = prosper, x = 'Quarter', color = default_color, ax = ax[1])
sb.countplot(data = prosper, x = 'Year', color = default_color, ax = ax[2])
sb.countplot(data = prosper, x = 'BorrowerState', color = default_color, ax = ax[3])
plt.xticks(rotation = 90)
plt.show()
January recorded the highest number of loans while the forth quarter takes the lead. I guess this might be due to the end of the year festivities. Also, Califonia takes the lead in BorrowerState, followed by Florida, New York and Texas. While Maine and Wyoming has the least number
fig, ax = plt.subplots(nrows=2, figsize = [15,12])
default_color = sb.color_palette()[0]
sb.countplot(data = prosper, x = 'Term', color = default_color, ax = ax[0])
sb.countplot(data = prosper, x = 'Occupation', color = default_color, ax = ax[1])
plt.xticks(rotation = 90)
plt.show()
More than 25% of the borrowers selected Other as their occupation. Students in Technical School, Judge, Students in Community college and Fresh college Students has the least loan
fig, ax = plt.subplots(nrows=3, figsize = [10,10])
default_color = sb.color_palette()[0]
sb.countplot(data = prosper, x = 'CreditScore', color = default_color, ax = ax[0])
sb.countplot(data = prosper, x = 'EmploymentStatus', color = default_color, ax = ax[1])
sb.countplot(data = prosper, x = 'ListingCategory', color = default_color, ax = ax[2])
plt.xticks(rotation = 90)
plt.show()
50% of the borrower have a good credit score, 70% are employed or have a full-time job and 50% of the loans are for debt consolidation
# distribution of DebtToIncomeRatio,
bins = np.arange(0,prosper['DebtToIncomeRatio'].max()+0.01, 0.01)
plt.figure(figsize=[8, 5])
plt.hist(data = prosper, x = 'DebtToIncomeRatio', bins = bins)
plt.xlabel('DebtToIncomeRatio');
prosper[prosper['DebtToIncomeRatio'] > 1.0].count()
Most of the values are between 0 and 1 with only 795, which amount to 0.71% of the dataset
# drop rows with DebtToIncomeRatio > 1.0
prosper = prosper[prosper['DebtToIncomeRatio'] <= 1.0]
# re-plot distribution of DebtToIncomeRatio,
bins = np.arange(0,prosper['DebtToIncomeRatio'].max()+0.01, 0.01)
plt.figure(figsize=[8, 5])
plt.hist(data = prosper, x = 'DebtToIncomeRatio', bins = bins)
plt.xlabel('DebtToIncomeRatio');
The distribution is unimodal and right skewed with a high spike at 0.25. Showing that 10% of the borrowers has a debt-to-income ratio of 0.25
bins = np.arange(0, prosper.BorrowerRate.max()+0.05, 0.01)
plt.figure(figsize=[8, 5])
plt.hist(data = prosper, x = 'BorrowerRate', bins = bins);
plt.xlabel('Borrower Rate');
bins = np.arange(0, prosper.BorrowerAPR.max()+0.05, 0.01)
plt.figure(figsize=[8, 5])
plt.hist(data = prosper, x = 'BorrowerAPR', bins = bins);
plt.xlabel('Borrower APR');
The distribution looks multimodal. A small peak centered at 0.1, a large peak centered at 0.2. There is also a small peak centered 0.3. Additionally, there is a very shape peak between 0.35 and 0.36. Only very few loans have APR greater than 0.4.
The distribution of borrowers APR looks multimodal and most of the values are at the range of 0.05 and 0.4. Same goes for the Borrower rate where most of the values are between 0.05 and 0.35. There are no unusual points and no need to perform any transformations.
Also, the distributions of stated monthly income is highly right screwed. DebtToIncomeRatio is also right skewed. There is no need to perform any transformations.
22 out of over 110,000 borrowers have MonthlyLoanPayment of over 1500, I had to drop them.
Stated monthly income has few values above 30,000. These were outliers, so need to be dropped
795 rows have values greater than 1 in DebtToIncomeRatio, since this amounts to 0.71%, it is less likely to affect our analysis, also any conclusion reached for values between 0 and 1 can be generalized to include values above 1.
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
num_vars1 = ['LoanOriginalAmount', 'DebtToIncomeRatio', 'StatedMonthlyIncome', 'MonthlyLoanPayment', 'BorrowerAPR',
'BorrowerRate']
num_vars2 = ['LoanOriginalAmount', 'Recommendations', 'InvestmentFromFriendsAmount', 'InvestmentFromFriendsCount', 'Investors']
cat_vars1 = ['Term', 'Quarter', 'CreditScore', 'EmploymentStatus']
cat_vars2 = ['ListingCategory', 'IsBorrowerHomeowner', 'BorrowerState', 'Year']
plt.figure(figsize = [8, 5])
sb.heatmap(prosper[num_vars1].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
Loan amount seem not to depend on the debt-to-income ratio, as the correlation is almost zero. Positive correlation between Loan amount and Monthly income was as expected. Also, debt-to-income ratio and monthly income have a negative correlation.
Interestingly, there is almost a perfect correlation between Loan amount and Monthly loan payment. Borrower APR and Rate both have a negative correlation with Loan amount, and surprisingly, the correlation coefficient is almost the same. Since the correlation coefficient between APR and rate is 0.99, I will use them interchangeably.
Another interesting fact is that debt-to-income ratio has a negative correlation with Monthly income.
plt.figure(figsize = [8, 5])
sb.heatmap(prosper[num_vars2].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
As expected, there is a positive correlation between Recommendations and Investment from friends. Most variables in num_var2 does not affect loan amount except for the Investor
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x, y, color = default_color)
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = prosper, y_vars = num_vars1, x_vars = ['Occupation'] ,
size = 10, aspect = 1.5)
g.map(boxgrid);
plt.xticks(rotation=90);
Students in Junior college have the least loan amount while pharmacists have the highest.
The Doctors has the highest income while College sophomore students have the least.
Pharmacists pay most loan monthly while students in Junior college and college sophomore pay least.
DebtToIncome ratio is highest for Teacher'a Aide
# plot matrix of numeric features against categorical features.
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x, y, color = default_color)
plt.figure(figsize = [20, 20])
g = sb.PairGrid(data = prosper, y_vars = num_vars1,
x_vars = cat_vars1, size = 5, aspect = 1.5)
g.map(boxgrid);
plt.xticks(rotation=30);
The loan amount increases with increase in term. The first quarter has the highest loan amount, followed by the fourth quarter while the remaining two are almost equal. Loan amount increases with better credit score. Self-employed and employed took the lead in loan amount while part-time comes last.
Stated monthly income as well as Monthly loan payment increases with better credit score. As expected, Unemployed borrowers have the least monthly income followed by borrowers with part-time job while self-employed and employed top the chart. Borrowers with part-time job have the least monthly loan payment.
Borrower APR and Rate drops with better credit score. Unemployed borrowers has the higher APR while those on part-time job has the highest
plt.figure(figsize = [20, 20])
g = sb.PairGrid(data = prosper, y_vars = num_vars1,
x_vars = 'ListingCategory', size = 5, aspect = 1.5)
g.map(boxgrid);
plt.xticks(rotation=90);
From the above figure, debt consolidation and baby & adoption top the chart of loan amount followed business and wedding loans
# comparing the categorical variables
plt.figure(figsize = [15, 30])
# subplot 1: EmploymentStatus vs CreditScore
plt.subplot(4, 1, 1)
sb.countplot(data = prosper, x = 'EmploymentStatus', hue = 'CreditScore', palette = 'Blues')
# subplot 2: CreditScore vs Term
plt.subplot(4, 1, 2)
sb.countplot(data = prosper, x = 'CreditScore', hue = 'Term', palette = 'Greens')
# subplot 3: CreditScore vs IsBorrowerHomeowner
plt.subplot(4, 1, 3)
sb.countplot(data = prosper, x = 'CreditScore', hue = 'IsBorrowerHomeowner', palette = 'Greens')
# subplot 4: ListingCategory vs. CreditScore
ax = plt.subplot(4, 1, 4)
sb.countplot(data = prosper, x = 'ListingCategory', hue = 'CreditScore', palette = 'Blues')
ax.legend(loc = 1, ncol = 2) # re-arrange legend to remove overlapping
plt.xticks(rotation = 90)
plt.show()
More of employed borrowers has a minimum of good credit score. Higher credit scores have more home owners.
# plot the categorical variables against LoanOriginalAmount, DebtToIncomeRatio and StatedMonthlyIncome again using violin plot
fig, ax = plt.subplots(ncols = 3, nrows = 4 , figsize = [15,15])
for i in range(len(cat_vars1)):
var = cat_vars1[i]
sb.violinplot(data = prosper, x = var, y = 'LoanOriginalAmount', ax = ax[i,0],
color = default_color)
sb.violinplot(data = prosper, x = var, y = 'DebtToIncomeRatio', ax = ax[i,1],
color = default_color)
sb.violinplot(data = prosper, x = var, y = 'StatedMonthlyIncome', ax = ax[i,2],
color = default_color)
plt.xticks(rotation = 60)
plt.show()
The plot shows that more of loans with higher amount are awarded with longer with longer term. Also, borrowers with higher credit score have access to higher loan amount. Self-employed and employed borrowers have access to higher loan amount part-time, retired and unemployed borrowers have access to more of lesser amount of loan. And as expected, Unemployed borrowers has the least monthly income.
Loan amount seem to be independent of depend on the debt-to-income ratio, but has a positive correlation with monthly income, as expected. It also has a high positive correlation with Monthly loan payment. APR and Borrower rate have a correlation of 0.99, and they both have a negative correlation with loan amount.
The higher the loan amount the longer the term of payment. Loan amount, also, increases with better credit score. Self-employed and employed borrowers have access to higher loan amount part-time, retired and unemployed borrowers have access to more of lesser amount of loan. Debt consolidation and Baby & Adoption top the chart of loan amount followed business and wedding loans
Borrowers with better credit score have higher Stated monthly income as well as Monthly loan payment. Borrower APR and Rate also drop with better credit score. More of employed borrowers have a minimum of good credit score. Borrowers with minimum of good credit score have more home owners than non-home owners
The higher the Monthly income, the lower the debt-to-income ratio. The Doctors has the highest income while Pharmacists takes the highest loan and pay highest loan monthly. The least loan goes to Students in Junior college while College sophomore students have the least Monthly Income. Unemployed borrowers has the higher APR
In this section, I will create plots of three or more variables to investigate the ProsperLoanData even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
# Term effect on relationship of APR and loan amount
g=sb.FacetGrid(data= prosper, aspect=1.2, size=10, col='Term', col_wrap=4)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend();
Irrespective of the term of loan, APR maintains a negative slop with loan amount
# Term effect on relationship of APR and loan amount
g=sb.FacetGrid(data= prosper, aspect=1.2, size=10, col='CreditScore', col_wrap=4)
g.map(sb.regplot, 'StatedMonthlyIncome', 'MonthlyLoanPayment', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend();
Monthly income and Monthly loan payment maintain positive slopes for all credit scores. Although, Good, Very good and Excellent credit scores have stronger relationship
# Prosper rating effect on relationship of APR and loan amount
g=sb.FacetGrid(data=prosper, aspect=1.2, size=5, col='CreditScore', col_wrap=4)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend();
The relationship between loan amount and APR changes from negative to slightly positive if the borrower has an excellent credit score.
# Prosper rating effect on relationship of APR and loan amount
g=sb.FacetGrid(data=prosper, aspect=1.2, size=5, col='EmploymentStatus', col_wrap=4)
g.map(sb.regplot, 'StatedMonthlyIncome', 'LoanOriginalAmount', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend();
All employment status show a positive slope between Loan amount and Monthly income except Unemployed which maintained a slightly negative relationship. Employed and Other have a very strong positive relationship.
fig = plt.figure(figsize = [8,6])
ax = sb.pointplot(data = prosper, x = 'EmploymentStatus', y = 'BorrowerAPR', hue = 'CreditScore',
palette = 'Blues', linestyles = '', dodge = 0.4, ci='sd')
plt.title('Borrower APR across CreditScore and Status')
plt.ylabel('Mean Borrower APR')
ax.legend(loc = 1, ncol = 2)
ax.set_yticklabels([],minor = True)
plt.xticks(rotation = 90);
APR drops as the credit score improves for all employment statuses. Borrowers whose employment status is 'Not available' has the lowest APR, 'Not employed' has the highest except for those with Poor credit score and Retired Borrowers with Poor credit score have the highest APR
fig = plt.figure(figsize = [8,6])
ax = sb.pointplot(data = prosper, x = 'EmploymentStatus', y = 'StatedMonthlyIncome', hue = 'CreditScore',
palette = 'Greens', linestyles = '', dodge = 0.4, ci='sd')
plt.title('Monthly Income across Credit Score and Status')
plt.ylabel('Mean Income')
ax.set_yticklabels([],minor = True);
ax.legend(ncol = 2)
plt.xticks(rotation = 90);
StatedMonthlyIncome increases with better credit score across all employment status except for 'Not employed' where the distribution pattern fluctuates Also, it seems most borrowers who have decided not to share their employment status earn well.
Borrowers with higher credit score seem to earn more and pay more loans per month.
APR drops as the credit score improves for all employment statuses. 'Not employed' borrowers has the highest APR except for those with Poor credit score. Retired Borrowers with Poor credit score have the highest APR StatedMonthlyIncome increases with better credit score across all employment status, although, it seems not to follow for 'Not employed' as the pattern fluctuates. Also, it seems most borrowers who have decided not to share their employment status earn well.
The higher the APR, the lower the loan amount and vice versa. This only changes when the borrower has a credit score of 800 and above.
All employment status shows a positive slope between Loan amount and Monthly income except Unemployed which maintained a slightly negative relationship. Employed and Other have a very strong positive relationship.
DebtToIncomeRatio seem to have little of no effect on both the BorrowerAPR and LoanOriginalAmount
In conclusion, some of the major determinants of loan amount seem to be the Borrower credit score, APR and Rate.
While higher credit score goes with higher amount, the reverse is the case for both Borrower APR and Rate
Credit score seem to largely depend on both employment status and monthly income, as self-employed and employed/full-time borrowers have better credit scores and borrowers with higher monthly income has higher credit scores
# save the cleaned dataset for explanatory data analysis
prosper.to_csv('prosperLoan_cleaned.csv', index=False)
# We save our data in a pickle format not to lose pandas types definitions in a text csv file.
prosper.to_pickle("./prosperLoan_cleaned.pkl")
# Read clean data
# prosper = pd.read_csv('prosperLoan_cleaned.csv')
prosper = pd.read_pickle("./prosperLoan_cleaned.pkl")